Vision Based Driving using Deep Learning

This document shows the development of a vision-based driving system for SuperTuxKart, a popular open-source racing game. To achieve this, first a low-level controller that serves as an auto-pilot for driving in the game will be designed. This controller will enable the vehicle to move, steer, and brake according to a set of predefined rules. Once we have the auto-pilot working, it will be used to train a vision-based driving system. By combining the designed controller with computer vision techniques, an intelligent driving system that can navigate through the game’s tracks with ease can be created.

I have provided a brief review of topics covered in this report:

CNN

Convolutional Neural Networks (CNNs) are widely used in computer vision tasks, particularly in image and video recognition. CNNs apply a series of convolutional filters to the input image, which allows the network to detect various features and patterns. These filters are then passed through a non-linear activation function and downsampled using pooling layers. Finally, the resulting features are fed into fully connected layers for classification or regression tasks.

FCN

Fully Convolutional Networks (FCNs) are a type of neural network that were developed for semantic segmentation tasks, where the goal is to classify each pixel in an image. Unlike traditional neural networks, FCNs replace the fully connected layers with convolutional layers, enabling the network to accept inputs of arbitrary size. The output of an FCN is a pixel-wise classification map, which can be upsampled to the original size of the input image. FCNs have achieved state-of-the-art performance in many semantic segmentation tasks, including those that require pixel-level classification.

It’s important to note that FCNs are actually a type of CNN. The main difference is that FCNs use convolutional layers exclusively, whereas traditional CNNs typically include fully connected layers for classification or regression tasks. By replacing the fully connected layers with convolutional layers, FCNs can perform pixel-wise classification, making them ideal for semantic segmentation tasks. In contrast, traditional CNNs are often used for image classification or object recognition tasks, where the goal is to classify the entire image or a region of interest within the image.

Computer Vision

Computer vision is a subset of artificial intelligence that enables computers to interpret and analyze images and video data. It involves the development of algorithms and techniques that can extract meaningful information from visual data, such as recognizing objects, detecting patterns, and identifying faces. Computer vision has a wide range of applications, including robotics, self-driving cars, medical imaging, and security systems. With the recent advancements in deep learning, computer vision has become increasingly sophisticated and accurate, enabling machines to perform tasks that were once thought impossible.

Low-Level Controller

The low-level controller allows the autonomous kart to take in an aim point and the current velocity of the car as inputs. The aim point is a point on the center of the track, 15 meters away from the kart. The code for the controller has been developed to ensure accurate steering, acceleration, braking, and drifting to complete each course within a specified time.

The below table shows the time constraints that each course needed to be completed within:

Course Time Constraints

zengarden lighthouse hacienda snowtuxpeak cornfield_crossing scotland
50s 50s 60s 60s 70s 70s

Input

aim_point: The aim-point directing where the cart is heading current_vel: The current velocity of the kart

Output

action: The next step that is taken for the kart to follow the given aim-point at the given current velocity

def control(aim_point, current_vel):
    action = pystk.Action()
    
    target_vel = 27.5

    action.acceleration = np.clip((target_vel - current_vel) / 10, -1, 1)

    action.brake = np.linalg.norm(aim_point) > 1.12 and current_vel > target_vel * 1.1

    action.steer = max(-1, min(1, (np.arctan2(aim_point[0], -aim_point[1]) * 3.8 / (1 + np.e**(-0.8 * (np.linalg.norm(aim_point) - 0.5)))) / np.pi))

    action.drift = abs(action.steer) > 0.511

    return action

Below shows a snapshot of the auto-pilot working on the snowtuxpeak course:

In this image you can see a red circle which indicates the aim-point and how the kart controlled by the autopilot is detecting that point and following it. The blue circle indicates the center of the kart.

In this next image you can see the kart drifting to keep up with the aim-point again on the snowtuxpeak course:

Create Training Dataset

The auto-pilot created with the controller function above is then used to generate a training dataset for our FCN model to create our vision based driving system.

Below is a snapshot of some of the images that were generated as training images from the auto-pilot controller:

The drive dataset will then be used to train the FCN show below in the planner model.

Planner

This planner is responsible for taking an image as input and then outputting the corresponding aim point in the image coordinate. Once the aim point is predicted, the designed controller will map those points to appropriate actions.

class Planner(torch.nn.Module):
    class Block(torch.nn.Module):
        def __init__(self, n_input, n_output, kernel_size=3, stride=2):
            super().__init__()
            self.c1 = torch.nn.Conv2d(n_input, n_output, kernel_size=kernel_size, padding=kernel_size // 2,
                                      stride=stride)
            self.c2 = torch.nn.Conv2d(n_output, n_output, kernel_size=kernel_size, padding=kernel_size // 2)
            self.c3 = torch.nn.Conv2d(n_output, n_output, kernel_size=kernel_size, padding=kernel_size // 2)
            self.b1 = torch.nn.BatchNorm2d(n_output)
            self.b2 = torch.nn.BatchNorm2d(n_output)
            self.b3 = torch.nn.BatchNorm2d(n_output)
            self.skip = torch.nn.Conv2d(n_input, n_output, kernel_size=1, stride=stride)

        def forward(self, x):
            return F.relu(self.b3(self.c3(F.relu(self.b2(self.c2(F.relu(self.b1(self.c1(x)))))))) + self.skip(x))

    class UpBlock(torch.nn.Module):
        def __init__(self, n_input, n_output, kernel_size=3, stride=2):
            super().__init__()
            self.c1 = torch.nn.ConvTranspose2d(n_input, n_output, kernel_size=kernel_size, padding=kernel_size // 2,
                                               stride=stride, output_padding=1)

        def forward(self, x):
            return F.relu(self.c1(x))

    def __init__(self, layers=[16,32,64,128], n_class=2, kernel_size=3, use_skip=True):
        super().__init__()
        self.input_mean = torch.Tensor([0.2788, 0.2657, 0.2629])
        self.input_std = torch.Tensor([0.2064, 0.1944, 0.2252])

        c = 3
        self.use_skip = use_skip
        self.n_conv = len(layers)
        skip_layer_size = [3] + layers[:-1]
        for i, l in enumerate(layers):
            self.add_module('conv%d' % i, self.Block(c, l, kernel_size, 2))
            c = l
        for i, l in list(enumerate(layers))[::-1]:
            self.add_module('upconv%d' % i, self.UpBlock(c, l, kernel_size, 2))
            c = l
            if self.use_skip:
                c += skip_layer_size[i]
        self.classifier = torch.nn.Conv2d(c, n_class, 1)
        self.size = torch.nn.Conv2d(c, 2, 1)

    def forward(self, x):
        z = (x - self.input_mean[None, :, None, None].to(x.device)) / self.input_std[None, :, None, None].to(x.device)
        up_activation = []
        for i in range(self.n_conv):
            up_activation.append(z)
            z = self._modules['conv%d' % i](z)

        for i in reversed(range(self.n_conv)):
            z = self._modules['upconv%d' % i](z)
            z = z[:, :, :up_activation[i].size(2), :up_activation[i].size(3)]
            if self.use_skip:
                z = torch.cat([z, up_activation[i]], dim=1)
        return spatial_argmax(self.classifier(z)[:,0,:,:])

The above code exhibits an FCN model for image processing tasks that predicts the aiming point of an image in image coordinates (x: 0..127, y: 0..95). This model is composed of three major components, namely the Planner class, Block class, and UpBlock class.

The Block class is responsible for constructing the fundamental building block for the model. It takes input data and utilizes various convolution layers alongside batch normalization and rectified linear unit (ReLU) activation functions. In addition, it implements skip connections to create a residual layer. The forward function executes the transformations in a forward pass to generate the output.

The UpBlock class constructs the up-sampling blocks for the model. It takes input data and applies a convolution transpose layer in conjunction with batch normalization and ReLU activation functions. The forward function applies the transformations in a forward pass to generate the output.

The Planner class is the primary class that defines the architecture of the model. It accepts a list of layer sizes as input and leverages these sizes to create the convolution and up-sampling layers in the model. It also implements skip connections, provided the use_skip argument. The forward function executes the transformations in a forward pass to generate the output. It first normalizes the input using the input_mean and input_std tensors, then applies the convolution and up-sampling layers to the input in a sequential manner. Finally, it applies the spatial_argmax() function to generate the output, which predicts the x and y coordinates of the aim point.

Spatial Argmax

The below function is utilized within the planner function and essentially is a utility function that calculates the coordinates of the highest probability prediction in an input tensor.

def spatial_argmax(logit):
    weights = F.softmax(logit.view(logit.size(0), -1), dim=-1).view_as(logit)
    x_y_tensor = torch.stack(((weights.sum(1) * torch.linspace(-1, 1, logit.size(2)).to(logit.device)[None]).sum(1),
                        (weights.sum(2) * torch.linspace(-1, 1, logit.size(1)).to(logit.device)[None]).sum(1)), 1)
    return x_y_tensor

It takes as input a logit tensor and uses F.softmax() to normalize the tensor along the second dimension to obtain a probability distribution. It then calculates the weighted sum of the x and y coordinates using the probabilities along the x and y dimensions of the probability distribution, respectively. The function returns the stacked x and y coordinates as a 2D tensor. This function is used in the Planner class to obtain the predicted aim point from the output of the neural network.

Input

logit: a logit tensor from the training drive_data created from the controller

Output

x_y_tensor: a 2D tensor that contains stacked x and y coordinates of the predicted aim-point with the highest probability

Training

The code snippet provided below defines a training function that trains the deep learning model defined in the Planner class.

def train(args):
    from os import path
    model = Planner()

    import torch

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    model = Planner().to(device)
    if args.continue_training:
        model.load_state_dict(torch.load(path.join(path.dirname(path.abspath(__file__)), 'planner.th')))

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate, weight_decay=1e-5)

    import inspect
    transform = eval(args.transform, {k: v for k, v in inspect.getmembers(dense_transforms) if inspect.isclass(v)})
    train_data = load_data('drive_data', num_workers=4,transform=transform)
    

    loss = torch.nn.MSELoss()
   

    global_step = 0
    for epoch in range(args.num_epoch):
        
        model.train()
        for img, label in train_data:
            img, label = img.to(device), label.to(device)
            logit = model(img)
            det_loss_val = loss(logit, label)
            loss_val = det_loss_val 
            optimizer.zero_grad()
            loss_val.backward()
            optimizer.step()
            global_step += 1

In this function, the Adam optimizer is used, which takes a specified learning rate and weight decay as inputs. Training data is loaded using the load_data function, which retrieves data from the drive_data directory and applies a designated transformation to it- in this case the transformation is random horizontal flip and a conversion of the data to a tensor.

For the purposes of training, the Mean Squared Error (MSE) loss function is utilized. During the training loop, which continues for a specified number of epochs, batches of data are fed to the model, and the optimizer updates the model parameters.

Vision Based Driving

After training the FCN on the drive dataset created using the designed controller, the model was able to run accurately and complete all courses in the allotted time.

Below shows a screenshot of the planner model working and driving utilizing auto-pilot functionality:

In this image, you can see that there is an additional circle colored green. This green circle represents the predicted aim-point from the FCN. The red still indicating the actual aim-point and the blue representing the center of the kart.

Summary

In this document, we detail the creation of a vision-based driving system for SuperTuxKart, a widely-used open-source racing game, using advanced deep learning and computer vision techniques. Our journey begins with an overview of convolution neural networks (CNNs) and fully convolution networks (FCNs), which are critical to understanding the development of our vision-based driving system.

Next, we delve into the creation of a low-level controller that enables the autonomous kart to move, steer, brake, and drift based on a pre-determined set of rules. This controller processes an aim point and the current velocity of the car as input, and generates the next action that the kart takes to track the given aim-point at the present velocity. The auto-pilot controller is subsequently employed to create a training dataset for the FCN model.

Lastly, we describe the development of an FCN-based planner that takes an image as input and predicts the aim-point for the low-level controller. The planner is trained on the dataset produced by the auto-pilot controller and accurately forecasts the aim-point for the kart to follow based on the current image.